Network visualization using output from text model

Data preparation

Import data. The data is obtained from the voting mechanism after we applied three embedding methods to the definitions of each indicator, where we’ve kept the pair of indicator and its related indicators if the similarity score meets the designed cutoff.

edgelist <- read.csv("~/Documents/GitHub/G5055_Practicum_Project2/Data/Text_Model_Data/edgelist.csv")
edgelist

For future classification of indicators into the goals they belong to, create the nodes dataframe:

nodes <- edgelist %>%
  select(indicator, related_indicator)
nodes <- data.frame(indicatorname = unlist(nodes, use.names = FALSE))
nodes <- distinct(nodes)
nodes$goal <- stri_match_first_regex(nodes$indicator, "(.*?)\\.")[,2]
nodes$goal <-as.numeric(nodes$goal)
nodes

Visualization

In the network graph below, the size of each vertices (each indicator) represents the number of related indicators that are connected to it. The width of the edges linking each indicator is determined according to the similarity score between each pair of related indicators. The indicators are grouped according to the goals they belong to, which are denoted by different colors of the vertices.

g<-graph_from_data_frame(edgelist, directed=FALSE, vertices=nodes)

#Add attributes
E(g)$weight<-E(g)$similarity_score
V(g)$in_degree<-degree(g, mode="in")
colrs<-c("#ea1d2d", "#d19f2a","#2d9a47", "#c22033","#ef412a", "#00add9", "#fdb714", "#8f1838", "#f36e24", "#e01a83", "#f99d25", "#cd8b2a", "#48773c", "#007dbb", "#40ae49",  "#00558a", "#1a3668")
V(g)$color<-colrs[V(g)$goal]

#Plot graph
plot(g, vertex.label=NA, edge.color="gray77", vertex.color=V(g)$color, vertex.size=V(g)$in_degree, edge.width=E(g)$weight*10, layout=layout_nicely(g))

plot(g, vertex.label.color="black", vertex.label.cex=2.5, edge.color="gray77", vertex.color=V(g)$color, vertex.size=V(g)$in_degree, edge.width=E(g)$weight*10, layout=layout_nicely(g))

#legend(x=-11, y=-11, c("Goal 1","Goal 2","Goal 3","Goal 4","Goal 5","Goal 6","Goal 7","Goal 8","Goal 9","Goal 10","Goal 11", "Goal 12","Goal 13", "Goal 14", "Goal 15", "Goal 16", "Goal 17"), pch=20, col="#777777", pt.bg=colrs, pt.cex=2, cex=.8, bty="n", ncol=1)

Key takeaways

By looking at the visualization, some of the goals that have most connections with other goals include Goal 17, Goal 1, Goal 15, Goal 9, etc.

For goals such as Goal 17 and Goal 1, they have large representative vertices (indicators) that have connection with many other indicators.

For goals such as Goal 15 and Goal 9, each of their vertices sizes is relatively small, but the number of indicators under the category that are connected to other indicators is relatively high, so that they are also easy to spot on the graph.

Indicators such as 5.c.1-5.4.1-4.a.1, 8.4.2-12.2.1-8.4.1-12.2.2 have much larger similarity scores than most other indicators by looking at edge width, suggesting a very close connection.

Network visualization using output from the social network model

Data preparation

Data for Indonesia and Guatemala used in this section could be obtained from GitHub as well. Specifically, after downloading indicator data from API, disaggregation measurements are eliminated and the data is processed into pivot format.

For each indicator, the measurements within each indicator, as well as the indicators themselves are deleted if they only contain one year of actual data. If the indicators are eligible to be kept, process them into a separate file with years as rows and measurements of the indicator as columns.

Linear regression method is used for data imputation. Once the datasets without missing value are created, PCA is adopted to reduct dimensionality so that each indicator have one value.

Correlation is then calculated and a social network model is applied. The end result file in the link includes a coefficient between each pair of indicators.

Indonesia

edgelistindo <- read.csv("~/Documents/GitHub/G5055_Practicum_Project2/Data/PCA_results/indo_coefficients.csv")
#Some preprocessing
edgelistindo<-edgelistindo%>%
  select(Var1, Var2, abs)%>%
  filter(Var1!=Var2)
edgelistindo

For future classification of indicators into the goals they belong to, create the nodes dataframe:

indonodes <- edgelistindo %>%
  select(Var1, Var2)
indonodes <- data.frame(indicatorname = unlist(indonodes, use.names = FALSE))
indonodes <- distinct(indonodes)
indonodes$goal <- stri_match_first_regex(indonodes$indicator, "(.*?)\\.")[,2]
indonodes$goal <-as.numeric(indonodes$goal)
indonodes

Guatemala

edgelistguate <- read.csv("~/Documents/GitHub/G5055_Practicum_Project2/Data/PCA_results/gua_coefficients.csv")
#Some preprocessing
edgelistguate<-edgelistguate%>%
  select(Var1, Var2, abs)%>%
  filter(Var1!=Var2)
edgelistguate
guatenodes <- edgelistguate %>%
  select(Var1, Var2)
guatenodes <- data.frame(indicatorname = unlist(guatenodes, use.names = FALSE))
guatenodes <- distinct(guatenodes)
guatenodes$goal <- stri_match_first_regex(guatenodes$indicator, "(.*?)\\.")[,2]
guatenodes$goal <-as.numeric(guatenodes$goal)
guatenodes

Visualization

Indonesia

g2<-graph_from_data_frame(edgelistindo, directed=FALSE, vertices=indonodes)

#Add attributes
E(g2)$weight<-E(g2)$abs
V(g2)$in_degree<-degree(g2, mode="in")
colrs<-c("#ea1d2d", "#d19f2a","#2d9a47", "#c22033","#ef412a", "#00add9", "#fdb714", "#8f1838", "#f36e24", "#e01a83", "#f99d25", "#cd8b2a", "#48773c", "#007dbb", "#40ae49",  "#00558a", "#1a3668")
V(g2)$color<-colrs[V(g2)$goal]

#Plot graph
plot(g2, vertex.label=NA, edge.color="gray77", vertex.color=V(g2)$color, vertex.size=V(g2)$in_degree/10, edge.width=E(g2)$weight, layout=layout_nicely(g2))

plot(g2, vertex.label.color="black", vertex.label.cex=2.5, edge.color="gray77", vertex.color=V(g2)$color, vertex.size=V(g2)$in_degree/10, edge.width=E(g2)$weight, layout=layout_nicely(g2))

Guatemala

g3<-graph_from_data_frame(edgelistguate, directed=FALSE, vertices=guatenodes)

#Add attributes
E(g3)$weight<-E(g3)$abs
V(g3)$in_degree<-degree(g3, mode="in")
colrs<-c("#ea1d2d", "#d19f2a","#2d9a47", "#c22033","#ef412a", "#00add9", "#fdb714", "#8f1838", "#f36e24", "#e01a83", "#f99d25", "#cd8b2a", "#48773c", "#007dbb", "#40ae49",  "#00558a", "#1a3668")
V(g3)$color<-colrs[V(g3)$goal]

#Plot graph
plot(g3, vertex.label=NA, edge.color="gray77", vertex.color=V(g3)$color, vertex.size=V(g3)$in_degree/10, edge.width=E(g3)$weight, layout=layout_nicely(g3))

plot(g3, vertex.label.color="black", vertex.label.cex=2.5, edge.color="gray77", vertex.color=V(g3)$color, vertex.size=V(g3)$in_degree/10, edge.width=E(g3)$weight, layout=layout_nicely(g3))

Key takeaways

By looking at the edges of the visualization for the social network model, the connections among indicators seem to be more intense than those reflected by the network based on text model.

The goals that have the most connections remain similar with the results from the text model, (Goal 17 for instance). The size of each vertex seems similar, suggesting that the intensity of each indicator’s connection with other indicators is much weaker in difference than reflected in the definitions of the indicators (text model).

In terms of difference between countries, the list of goals included and are eminent looks similar. For instance, in terms of the strength of connections, the indicators that have relatively weaker connections (smaller number of links) are mostly similar (Goal 1,2,4, 11, 12, 17), though with slight differences.